The Allotrope Data Format (ADF) [[!ADF]] consists of several APIs and taxonomies. This document consitutes the specification of the ADF Check Sum Computation API for creating hash codes on an ADF file, or parts thereof.
THESE MATERIALS ARE PROVIDED "AS IS" AND ALLOTROPE EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE WARRANTIES OF NON-INFRINGEMENT, TITLE, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
This document is part of a set of specifications on the Allotrope Data Format [[!ADF]]
An underlying design principle of the ADF check sum is that a local change (e.g., in a part of a data cube) must not require re-reading the entire file to compute the check sum.
In principle, ADF is designed to support the choice between multiple algorithms to compute the check sum. Currently, there is only one implementation, which depends on the internal representation storage as an HDF5 file. This document describes this algorithm and its application to various component parts of the ADF file.The document is structured as follows: First, the prerequisites are defined, followed by a detailed description of the hash computation on different parts of an ADF file.
Within this specification, the following namespace prefix bindings are used:
Prefix | Namespace |
---|---|
owl: | http://www.w3.org/2002/07/owl# |
rdf: | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfs: | http://www.w3.org/2000/01/rdf-schema# |
xsd: | http://www.w3.org/2001/XMLSchema# |
dct: | http://purl.org/dc/terms/ |
skos: | http://www.w3.org/2004/02/skos/core# |
qb: | http://purl.org/linked-data/cube# |
adf-dp: | http://purl.allotrope.org/ontologies/datapackage# |
adf-dc: | http://purl.allotrope.org/ontologies/datacube# |
adf-dc-hdf: | http://purl.allotrope.org/ontologies/datacube-to-hdf5-map# |
adf-audit: | http://purl.allotrope.org/ontologies/audit# |
lc-hash: | http://id.loc.gov/vocabulary/cryptographicHashFunctions/ |
Within this document, decimal numbers will use a dot "." as the decimal mark.
In the following, we often need to represent lists of bytes. For each byte, we use two hexadecimal digits; for several bytes, we simply append those digits.
We use the terms "check sum" and "hash" synonymously.
When converting data to byte strings, different data types must be handled separately. In the rest of the document, we always specify which data type is used when an object is converted to bytes.
The integer data types are written in a two’s complement with "big endian" byte order.
The table below shows an example for each type:
Java data type | C# data type | Example value | Bytes |
---|---|---|---|
byte | sbyte | 45 | 2d |
short | short | -7498 | e2b6 |
int | int | 1318336784 | 4e943910 |
long | long | -4895739457839457 | ffee9b59d4b3de9f |
The floating point data types (float and double) are written according to IEEE 754 with “big endian” byte order.
Java data type | C# data type | Example value | Bytes |
---|---|---|---|
float | float | -16e10 | d21502f9 |
double | double | 3254e43 | 4996cc9385c1f043 |
We encode strings by first writing the number of characters of that string as int (see above), followed by the string itself encoded as UTF-8.
Example: "Hällo World!" (umlaut ä instead of e is intended for the example) is encoded to the bytes 0000000c48c3a46c6c6f20576f726c6421. The first four bytes, 0000000c, encode the number of characters (12), and the next 13 bytes are the UTF-8 encoding of the string. Note that the umlaut ä is encoded to 2 bytes (c3a4) after the H (48), which leads to one byte more than the length of the string.
ADF uses existing message digest algorithms to convert a stream of bytes to a check sum.
Currently supported are MD2, MD5, SHA-1, SHA-256, SHA-384, and SHA-512. The checksums in ADF (in general and for the DataPackage) are by default intended to ensure the integrity of the ADF file against accidental storage or transfer errors. To do this while providing the best performance, the default algorithm is set to MD5. This algorithm is no longer considered to be cryptographically secure, but for the intended purpose one of the best candidates. However, users can choose a configuration with one of the other algorithms. The checksum service can be configured per ADF file when activating hashing for the file. For DataPackage the algorithm can be set for each DataPackage file created individually.
When we state that data is added to a message digest, we mean that the data is converted to bytes (by the rules above), and then the resulting bytes are sequentially added to the digest algorithm. This must yield the same result as constructing an array of bytes that is filled by the outcomes of the conversions and then running the message digest algorithm on it.
We use a hierarchical approach to combine the hash values of several parts of the ADF file.
In ADF, we use this mechanism at several levels to minimize computation costs.
For the sake of brevity, we use a fictitious digest algorithm, SHA-32, for examples, where we just take the first 4 bytes of the SHA-256 algorithm’s result.
In an ADF file's meta model, the file itself is represented by a resource
?F
, which is of type hdf:File.
If hashing is enabled, the meta-data model will contain the statement
?F adf-audit:hasDigestMethod [ a adf-audit:DigestMethod; adf-audit:hasCanonicalizationAlgorithm adf-audit:c14n-adf-hdf-1.0 ; adf-audit:hasDigestAlgorithm lc-hash:sha256 ]where
adf-audit:c14n-adf-hdf-1.0
refers to the chosen algorithm
and lc-hash:sha256
to the message digest
algorithm used to calculate a digest from a byte stream.
Currently, there are two implementations of an ADF hash algorithm, of which one is only available for read-only access to older files to support backward compatibility. The only currently available choice when initializing check sums for a file is presented below. The configuration will allow to use other implementations in the future.
Valid options for the message digest algorithm are:
lc-hash:md2
lc-hash:md5
lc-hash:sha1
lc-hash:sha256
lc-hash:sha384
lc-hash:sha512
The file's hash value is stored in the HDF5 root group's attribute
ADF_CHECKSUM
, encoded as a hexadecimal string.
adf-audit:c14n-adf-hdf-2.0
.
This is the current default implementation for the check sums.
The algorithm is specific to the usage of HDF5 as the underlying file format and internal representation of the data structure. In particular, the serialization of the data description is not deterministic in the sense that the same content of the data description might result in different binary representations in the file, depending on the order of additions or removals.
The computation of the hash value is based on the structure of the ADF file's HDF5 representation. We define the computation of hashes for HDF5 datasets and HDF5 groups. The overall hash value of the file is defined as the HDF5 root group's hash value.
For each HDF5 group or dataset, the computed check-sum is stored in
an attribute The hash value of an HDF5 group is based on its attributes
and its child elements (other groups and datasets).
The following values are added to the message digest algorithm: For every non-scalar data set, a check sum data set with the same number of dimensions is generated.
It has the same path and name as the original dataset, but is located under
the HDF5 group For every dimension i, a number hashblocki is chosen that describes the number of elements that are grouped together in that direction to compute a single hash-value.
We denote the size of the data set in dimension i with sizei. In the hash data set, we store a check sum for each group.
Each dimension of the hash data set is of size hashsizei = sizei / hashblocki, where we round up (hashsizei = ceil(sizei / hashblocki)). How hashblocki is chosen is up to the implementation.
The chosen values are stored in the HDF5 attribute Each hash value of a block is the result of computing the message digest of the following values:
The data set used to store the check sums is of data type byte;
the last dimension is multiplied by the length of the resulting
digest (e.g. 32 for SHA-256) to store all bytes of the hash in
the data set.
The overall check sum for the data set is computed by adding the
following data to the message digest algorithm:
Please note: This algorithm is deprecated and only documented here
for reference to support older ADF files.
Current implementations do not support writing files with it and future
versions might drop support completely.
The algorithm is specific to the usage of HDF5 as the underlying file
format and internal representation of the data structure.
Especially, the serialization of the data description is not deterministic
in the sense that the same content of the data description might result
in different binary representations in the file, depending on the order
of additions or removals. To compute the hash code, a hash code for separate parts of the ADF file
is computed.
The overall hash for the data cubes is computed by adding
Information on the hash is stored in the meta data for the Cube ?Cube
with
To compute the hash of a data cube, we first determine the
measure datasets and scale datasets that are used to
store the content of the data cube from the meta-data.
The RDF resources representing the data sets are sorted
alphabetically by their URI.
The hash value of the data set is added to the message digest.
How the hash value of a data set is computed is
described below.
If the data cube contains strings or IRIs in its measurements
or scales, the cube contains a dictionary that contains a representation
of the strings. If a dictionary is present for the data cube,
all the data sets of the dictionary are sorted alphabetically
by the URI of the RDF resources that represent them.
For these datasets, additional attributes of the hash set are included
in the hash value, see below.
For computing the hash value of the data package, we only consider
the data sets of the files, not the directory layout and other
meta-data like time of last change.
The directory layout and meta-data are stored in the technical
model of the ADF file, whose hash computation is described
below. Each file in the data description is backed by an HDF5 dataset.
First, we take the RDF representation of all files (resources whose
type is Several HDF5 datasets are used to store the information. Their
hash values are added to the message digest for the computation
of the overall data description hash.
The following list contains the data sets whose hash values are added,
with attribute names that must be included in the hash
(see below).
For each data set in the list, the following data is added to
the message digest:
The hash values of the data sets are not stored in the data
description itself. They are stored as a hexadecimal string
in the HDF5 attribute " Like the data description, the audit store contains RDF graphs.
To compute the audit's check sum, we first add the dictionary's
dataset's check sums analogously to the data description,
only the path is relative to Then, all data sets that are included in any audit trail are
sorted alphabetically by their URI and for each data set
In the current version of ADF, the file format HDF5 is used as an underlying storage solution.
Most of the stored information in an ADF file are stored in HDF5 datasets.
This section describes how the check sum of a data set is computed. Because data sets can contain large amounts of data, we apply again the approach of combining multiple hashes to one overall hash. A data set has n dimensions and a data type. Each entry of the data set belongs to the data type.
In ADF, the data types byte, short, int, long, float and double can be used. When we use coordinates of elements of a data set in the following, we start with 0 (not 1).ADF_CHECKSUM
. This allows to re-compute
the complete check sum when only a part of the file has been changed.
During verification, it allows to detect in which part of
file data has been corrupted.
Computation of HDF5 group hash values
attributes
is added.
H5T.INTEGER
whose
size is smaller 4 or whose size is 4 and whose sign is not 0)
are encoded as integers.H5T.INTEGER
which
do not fulfill the conditions of the previous item and
whose
size is smaller 8 or whose size is 8 and whose sign is not 0)
are encoded as longs.H5T.STRING
)
are encoded as UTF8 strings.elements
is added.
Excluded attributes
Attributes with the following names are not included in any hash:
ADF_CHECKSUM
checksum-adf-hdf-2.0
adf-hdf-checksum-algorithm
Excluded groups
The group /check-sums
and any sub-group of it will not be
included in the computation of the hash value.
Computation of HDF5 dataset hash values
/check-sums
.
A scalar dataset is one without any dimensions. Such a dataset just contains just a single value.
hash_block_size
as a string which contains the comma-separated block sizes.
hash_block_size
contains the string 4,3
.
Specification of the algorithm ADF-HDF-1.0
The algorithm's URI is adf-audit:c14n-adf-hdf-1.0
.
The hash value of each part is added to the input of the message digest
that computes the overall check sum.
Computation of the data cube part
for each data cube to the message digest algorithm.
?Cube adf-audit:hasDigest [
adf-audit:hasDigestMethod lc-hash:sha256;
adf-audit:digestValue "..."^xsd:base64Binary
]
where "..."
contains the base-64-encoded hash value
of the data cube.
Computation of single data cube's hash
Computation of the data package part
adf-dp:File
) and sort them by their URI.
For every file, we add
Computation of the data description part
The path given in the table is relative to the HDF5 group
Path Attributes Include size attribute of group dictionary/bytes nextId yes dictionary/keys nextId no dictionary/nodes nextId no nodes_GSPO/nodes nextId yes nodes_GPOS/nodes nextId yes nodes_GOSP/nodes nextId yes nodes_SPOG/nodes nextId yes nodes_POSG/nodes nextId yes nodes_OSPG/nodes nextId yes quads nextId, size no /data-description
,
e.g.
/data-description/dictionary/bytes
.
CHECKSUM
" instead.
Computation of the audit part
/audit-trail
.
Computing Check Sums of Data Sets
Constructing a Data Set for Check Sums
For this version of the algorithm, check sums data sets are created
according to the section on the more
recent algorithm, with a few deviations:
Meta-information for the stored datasets
Information about data sets that store hash values are stored as
RDF data in the technical data description.
Let's assume that a data set is represented by the resource
adf://some_dataset
. A resource representing the hash
data set looks like the following:
<adf://some_arbitrary_uri> a adf-hash:HashDataSet;
adf-audit:hasDigest [
a adf-audit:Digest
adf-hash:checksumDatasetOf <adf://some_dataset>
audit:hasDigestMethod [
a audit:DigestMethod;
adf-hash:hashDimension [
a adf-hash:HashDimensionDescription;
hdf:order 1 ;
adf-hash:hashBlockSize 1000;
];
adf-hash:hashDimension [
a adf-hash:HashDimensionDescription;
hdf:order 2;
adf-hash:hashBlockSize 500;
];
adf-audit:hasDigestAlgorithm lc-hash:sha-256;
]
]
The shown check sum data set has two dimensions where the number
of elements to hash in the first dimension is 1000, in the second
dimension, 500. The digest algorithm SHA-256, which is used to aggregate
the data is also specified.
Change History
Version
Release Date
Remarks
1.2.0
2016-12-07
1.3.0 Preview
2017-03-31
1.3.0 RC
2017-05-05
1.3.0 RF
2017-06-30
1.4.2
2018-01-25
1.4.3 RC
2018-10-11
1.4.5 RF
2018-12-17
1.5.0 RC
2019-12-12
1.5.0 RF
2020-03-03
1.5.3 RF
2020-11-30